Search CORE

5 research outputs found

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Author: Ardakani Arash
Cheung Alvin
Haan Altan
Iancu Costin
Popovici Doru Thom
Sen Koushik
Tan Shangyin
Publication venue
Publication date: 29/05/2023
Field of study

Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for particular layers to balance the load of dynamic activations and to minimize the memory footprint of static activations, where static activations refer to those that cannot be discarded regardless of freezing. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a single 32GB GPU without any significant accuracy degradation

arXiv.org e-Print Archive

Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory

Author: Asadchev Andrey
Clark David
de Jong Wibe A.
Popovici Doru Thom
Valeev Edward F.
Waldrop Johnathan
Williams-Young David B.
Windus Theresa
Publication venue: 'AIP Publishing'
Publication date: 24/03/2023
Field of study

With the growing reliance of modern supercomputers on accelerator-based architectures such a GPUs, the development and optimization of electronic structure methods to exploit these massively parallel resources has become a recent priority. While significant strides have been made in the development of GPU accelerated, distributed memory algorithms for many-body (e.g. coupled-cluster) and spectral single-body (e.g. planewave, real-space and finite-element density functional theory [DFT]), the vast majority of GPU-accelerated Gaussian atomic orbital methods have focused on shared memory systems with only a handful of examples pursuing massive parallelism on distributed memory GPU architectures. In the present work, we present a set of distributed memory algorithms for the evaluation of the Coulomb and exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods, respectively. The absolute performance and strong scalability of the developed methods are demonstrated on systems ranging from a few hundred to over one thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter supercomputer.Comment: 45 pages, 9 figure

arXiv.org e-Print Archive

SPIRAL: Extreme Performance Portability

Author: Daniele G. Spampinato
Doru Thom Popovici
Franz Franchetti
James C. Hoe
Jeremy R. Johnson
Jose M. F. Moura
Markus Puschel
Richard M. Veras
Tze Meng Low
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref